{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7d1067a3",
   "metadata": {},
   "source": [
    "\n",
    "# Data Cleaning and Descriptive Analysis\n",
    "\n",
    "This notebook demonstrates data cleaning techniques and descriptive analysis of clinical data. Dataset links are provided for reproducibility.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5222cb20",
   "metadata": {},
   "source": [
    "\n",
    "## Dataset Information and Download Links\n",
    "\n",
    "The examples in this notebook use a **Diabetes dataset**, which can be downloaded from the following source:\n",
    "\n",
    "1. **Kaggle:**\n",
    "   - [Diabetes Dataset - Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data)\n",
    "   - This dataset includes detailed clinical data for diabetes prediction and analysis.\n",
    "\n",
    "### Dataset Attributes\n",
    "\n",
    "- **Pregnancies**: Number of pregnancies.\n",
    "- **Glucose**: Plasma glucose concentration.\n",
    "- **BloodPressure**: Diastolic blood pressure (mm Hg).\n",
    "- **SkinThickness**: Triceps skinfold thickness (mm).\n",
    "- **Insulin**: 2-Hour serum insulin (mu U/ml).\n",
    "- **BMI**: Body mass index (weight in kg/(height in m)^2).\n",
    "- **DiabetesPedigreeFunction**: Diabetes pedigree function.\n",
    "- **Age**: Age of the patient.\n",
    "- **Outcome**: Class variable (0 = non-diabetic, 1 = diabetic).\n",
    "\n",
    "### Usage Notes\n",
    "\n",
    "- Ensure the dataset is preprocessed (e.g., handle missing values and normalize if needed).\n",
    "- Refer to the [dataset documentation](https://www.kaggle.com/datasets/mathchi/diabetes-data) for more information.\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7276065",
   "metadata": {},
   "source": [
    "\n",
    "## Data Cleaning: Handling Missing Values\n",
    "\n",
    "Missing values can affect data analysis. This section demonstrates how to identify and handle missing values.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57760f30",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "\n",
    "# Load the dataset (replace with your file path)\n",
    "data = pd.read_csv(r'C:\\Path\\to\\diabetes.csv')\n",
    "\n",
    "# Check for missing values\n",
    "print(\"Missing values before cleaning:\")\n",
    "print(data.isna().sum())\n",
    "\n",
    "# Fill missing values with the median\n",
    "data = data.fillna(data.median())\n",
    "\n",
    "print(\"Missing values after cleaning:\")\n",
    "print(data.isna().sum())\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df3d283a",
   "metadata": {},
   "source": [
    "\n",
    "## Descriptive Statistics: Summarizing the Data\n",
    "\n",
    "Descriptive statistics provide insights into the distribution and summary of clinical data.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed7b65f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Summary statistics for numerical columns\n",
    "summary_stats = data.describe()\n",
    "print(summary_stats)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5582a0c",
   "metadata": {},
   "source": [
    "\n",
    "## Univariate Analysis: Distribution of Glucose Levels\n",
    "\n",
    "Visualizing the distribution of glucose levels helps understand its spread and central tendency.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e43cdc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Histogram for glucose levels\n",
    "sns.histplot(data['Glucose'], kde=True)\n",
    "plt.title(\"Distribution of Glucose Levels\")\n",
    "plt.xlabel(\"Glucose\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff332b43",
   "metadata": {},
   "source": [
    "\n",
    "## Bivariate Analysis: Glucose vs BMI\n",
    "\n",
    "Scatter plots visualize the relationship between two numerical variables.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "225e14ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Scatter plot for Glucose vs BMI\n",
    "sns.scatterplot(x=data['BMI'], y=data['Glucose'])\n",
    "plt.title(\"Glucose vs BMI\")\n",
    "plt.xlabel(\"BMI\")\n",
    "plt.ylabel(\"Glucose\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf22e0b8",
   "metadata": {},
   "source": [
    "\n",
    "## Grouping and Aggregation: Mean Glucose by Outcome\n",
    "\n",
    "Grouping the data by outcome helps compare averages across diabetic and non-diabetic groups.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a4854b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Group by outcome and calculate mean glucose\n",
    "mean_glucose_by_outcome = data.groupby('Outcome')['Glucose'].mean()\n",
    "print(mean_glucose_by_outcome)\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
